这项工作确立了梯度流量(GF)和随机梯度下降(SGD)的低测试误差(SGD)在具有标准初始化的两层relu网络上,在三个方案中,关键的重量集很少旋转(自然要么是由于GF和SGD,要么是由于GF和SGD,或达到人为的约束),并利用边缘作为核心分析技术。第一个制度几乎是初始化的,特别是直到权重以$ \ mathcal {o}(\ sqrt m)$移动为止,其中$ m $表示网络宽度,这与$ \ mathcal {o}(O}(O}(O})形成鲜明对比) 1)神经切线内核(NTK)允许的重量运动;在这里显示,GF和SGD仅需要网络宽度和样本数量与NTK边缘成反比,此外,GF至少达到了NTK保证金本身,这足以建立避免距离范围目标的不良KKT点的逃脱,该点的距离逃脱了。而先前的工作只能确定不折扣但任意的边缘。第二个制度是神经塌陷(NC)设置,其中数据在于极度隔离的组中,样品复杂性尺度与组数。在这里,先前工作的贡献是对初始化的整个GF轨迹的分析。最后,如果内层的权重限制为仅在规范中变化并且无法旋转,则具有较大宽度的GF达到了全球最大边缘,并且其样品复杂度与它们的逆尺度相比;这与先前的工作相反,后者需要无限的宽度和一个棘手的双收敛假设。作为纯粹的技术贡献,这项工作开发了各种潜在功能和其他工具,希望有助于未来的工作。
translated by 谷歌翻译
这项工作为线性预测指标上的迭代固定点方法提供了测试误差范围 - 特别是随机和批次镜下降(MD)和随机时间差学习(TD),并具有两个核心贡献:(a)一种单个证明技术尽管没有预测,正则化或任何等效物,即使Optima具有较大或无限的规范,也可以给予高概率保证,以实现四足动物的损失(例如,提供平方和物流损失的统一处理); (b)不取决于全局问题结构(例如条件数和最大利润率)的本地适应率,而是基于可能遭受一些小额测试误差的低规范预测因子的性质。证明技术是一个基本和多功能的耦合参数,在以下设置中进行了证明:在可实现的情况下随机MD;一般马尔可夫数据的随机MD;一般IID数据的批量MD;重尾数据的随机MD(仍然没有预测);马尔可夫链上的随机TD(所有先前的随机TD边界都在预期)。
translated by 谷歌翻译
这项工作研究了浅relu网络通过梯度下降训练的浅relu网络,在底层数据分布一般的二进制分类数据上,(最佳)贝叶斯风险不一定为零。在此设置中,表明,在早期停止的梯度下降达到人口风险在不仅仅是逻辑和错误分类损失方面,也可以在校准方面任意接近最佳,这意味着其输出的符合矩阵映射近似于真正的条件分布任意精细。此外,这种分析的必要迭代,样本和架构复杂性,并且在真实条件模型的某种复杂度测量方面都是自然的。最后,虽然没有表明需要早期停止是必要的,但是显示满足局部内插特性的任何单变量分类器是不一致的。
translated by 谷歌翻译
This paper presents a margin-based multiclass generalization bound for neural networks that scales with their margin-normalized spectral complexity: their Lipschitz constant, meaning the product of the spectral norms of the weight matrices, times a certain correction factor. This bound is empirically investigated for a standard AlexNet network trained with SGD on the mnist and cifar10 datasets, with both original and random labels; the bound, the Lipschitz constants, and the excess risks are all in direct correlation, suggesting both that SGD selects predictors whose complexity scales with the difficulty of the learning task, and secondly that the presented bound is sensitive to this complexity.
translated by 谷歌翻译
This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models-including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation-which exploits a certain tensor structure in their low-order observable moments (typically, of second-and third-order). Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.
translated by 谷歌翻译